try:
import maestro
except ModuleNotFoundError:
%pip install "maestro[florence_2]"
Installation
Imports
# Config
import os
import numpy as np
from tqdm.notebook import tqdm
from roboflow import Roboflow
from maestro.trainer.common.datasets import RoboflowJSONLDataset
from maestro.trainer.models.florence_2.inference import predict
from maestro.trainer.models.florence_2.checkpoints import (
OptimizationStrategy, load_model)import supervision as sv
from PIL import Image
from dotenv import load_dotenv
load_dotenv()
True
Ice Breaker
- VLMs are useful for object detection in zero-shot and fine-tuning settings.
- Unlike traditional object detection models like YOLO, VLMs are sensitive to the class names due to the “language” part of the model.
- In other words, VLMs are fundamentally made to connect the visual and language domains.
- Thus, the following questions naturally arise:
- How VLM performance varies with variations in class names?
Build fine-tuning pipeline for Florence-2
This part is adapted from this Roboflow notebook.
Download dataset
= os.getenv('ROBOFLOW_API_KEY')
ROBOFLOW_API_KEY = Roboflow(api_key=ROBOFLOW_API_KEY)
rf
= rf.workspace("roboflow-jvuqo").project("poker-cards-fmjio")
project = project.version(4)
version = version.download("florence2-od", "/tmp/poker-cards-fmjio") dataset
loading Roboflow workspace...
loading Roboflow project...
!head -n 1 {dataset.location}/train/annotations.jsonl
{"image":"IMG_20220316_172418_jpg.rf.e3cb4a86dc0247e71e3697aa3e9db923.jpg","prefix":"<OD>","suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>jack of clubs<loc_566><loc_166><loc_823><loc_432>queen of clubs<loc_365><loc_465><loc_765><loc_999>king of clubs<loc_601><loc_440><loc_949><loc_873>"}
Command | Type | Description |
---|---|---|
–dataset | TEXT | Path to the dataset used for training [default: None] [required] |
–model_id | TEXT | Identifier for the Florence-2 model [default: microsoft/Florence-2-base-ft] |
–revision | TEXT | Model revision to use [default: refs/pr/20] |
–device | TEXT | Device to use for training [default: auto] |
–optimization_strategy | TEXT | Optimization strategy: lora, freeze, or none [default: lora] |
–cache_dir | TEXT | Directory to cache the model weights locally [default: None] |
–epochs | INTEGER | Number of training epochs [default: 10] |
–lr | FLOAT | Learning rate for training [default: 1e-05] |
–batch_size | INTEGER | Training batch size [default: 4] |
–accumulate_grad_batches | INTEGER | Number of batches to accumulate for gradient updates [default: 8] |
–val_batch_size | INTEGER | Validation batch size [default: None] |
–num_workers | INTEGER | Number of workers for data loading [default: |
–val_num_workers | INTEGER | Number of workers for validation data loading [default: None] |
–output_dir | TEXT | Directory to store training outputs [default: ./training/florence_2] |
–metrics | TEXT | List of metrics to track during training |
–max_new_tokens | INTEGER | Maximum number of new tokens generated during inference [default: 1024] |
–random_seed | INTEGER | Random seed for ensuring reproducibility. If None, no seed is set [default: None] |